Information Retrieval from Historical Corpora

نویسندگان

  • Loes Braun
  • Floris Wiesman
  • Ida Sprinkhuizen-Kuyper
چکیده

With the increasing number of documents that are available in digital form, also the number of digital historical documents is increasing (Berkvens, 2001). It cannot be assumed that standard IR systems perform well on historical documents: historical texts differ from modern texts in three ways (Hüning, 1996; Van Der Horst and Marschall, 1989): (a) vocabularies have changed, (b) spelling has changed (sometimes to the extent that two variants of the same word are not even recognizable as such), and (c) spelling used to be highly inconsistent (in the Netherlands until the 19th century). The goals of this research were to identify the bottlenecks of information retrieval from historical corpora and to find solutions for these bottlenecks. Section 1 describes our efforts to further identify the bottlenecks, section 2 examines potential solutions, and section 3 sketches our approach to alleviate the bottlenecks. Finally section 4 draws conclusions and provides directions for future research.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Information Retrieval from Dutch Historical Corpora

Preface Writing a thesis is often regarded as rather solitary labour. During my research I learned that the opposite is true, since writing this thesis would not have been possible without the support of several people who I would like to acknowledge in this preface. First and foremost I would like to express my gratitude to the members of my thesis committee. I would like to thank Prof. dr. H....

متن کامل

Tagging Historical Corpora - the problem of spelling variation

Spelling issues tend to create relatively minor (though still complex) problems for corpus linguistics, information retrieval and natural language processing tasks that use ‘standard’ or modern varieties of English. For example, in corpus annotation, we have to decide how to deal with tokenisation issues such as whether (i) periods represent sentence boundaries or acronyms and (ii) apostrophes ...

متن کامل

Unsupervised Learning of Edit Distance Weights for Retrieving Historical Spelling Variations

While todays orthography is very strict and seldom changes, this has not always been true. In historical texts spelling of words often not only varies from todays but in some periods even varies from use to use in a single text. Information retrieval on historical corpora can deal with these variations using fuzzy matching techniques based on Levenshtein-Distance using stochastic weights. In pa...

متن کامل

Guideline: Multiple Hierarchies

As the title of the Dagstuhl Seminar Digital Historical Corpora Architecture, Annotation, and Retrieval already suggests, corpus architecture and corpus annotation is an important topic for representing (historical) texts. Especially the limitation of SGML-based markup languages to tree structured annotations raises a special problems when dealing with manuscripts: How is it possible to represe...

متن کامل

Speech Recognition and Information Retrieval: Experiments in Retrieving Spoken Documents

The Informedia Digital Video Library Project at Carnegie Mellon University is making large corpora of video and audio data available for full content retrieval by integrating natural language understanding, image processing, speech recognition and information retrieval. Information retrieval of from corpora of speech recognition output is critical to the project’s success. In this paper, we out...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002